-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qorb integration as connection pool for database #5876
Conversation
// Yes, panicking in the above scenario sucks. But this type is already | ||
// pretty ubiquitous within Omicron, and integration with the qorb | ||
// connection pooling library requires access to database by SocketAddr. | ||
pub fn address(&self) -> SocketAddr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love that this method makes this assumption, though this appears to be embedded mostly within tests.
It's necessary within nexus/db-queries/src/db/pool.rs
to implement SingleHostResolver
, which is our way of identifying backends as SocketAddr
types.
I can try to make the "address" of the backend generic through qorb, so we could leave it as a URL?
@@ -210,7 +207,7 @@ impl ServerContext { | |||
// nexus in dev for everyone | |||
|
|||
// Set up DNS Client | |||
let resolver = match config.deployment.internal_dns { | |||
let (resolver, dns_addrs) = match config.deployment.internal_dns { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to pull out the DNS addresses here to help bootstrap the qorb pool below.
We will eventually be able to dynamically discover DNS servers with qorb, but we need to supply a value of https://github.com/oxidecomputer/qorb/blob/bb432c9574c393bf2eb7b19516b996aa5584573e/src/resolvers/dns.rs#L387-L390 first
nexus/db-queries/src/db/pool.rs
Outdated
query_interval: tokio::time::Duration::from_secs(10), | ||
hardcoded_ttl: Some(std::time::Duration::from_secs(60)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hardcoded_ttl
config is load-bearing. The default is None
, and the DnsResolver
uses a TTL from the DNS records themselves to identify the the records as present or not. However, all our DNS records are being saved with a TTL of "zero", so they are immediately discarded.
/// A customizer for all new connections made to CockroachDB, from Diesel. | ||
#[derive(Debug)] | ||
pub(crate) struct ConnectionCustomizer {} | ||
/// A [backend::Connector] which provides access to [PgConnection]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole implementation is taken from https://github.com/oxidecomputer/qorb/blob/master/src/connectors/diesel_pg.rs -- I originally made the implementation there as an example, and to make sure I didn't break anything, but IMO it makes sense for Omicron to have total control over this (e.g., the layering, the control of full table scans, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks really nice! I left a few minor suggestions --- I hope they're helpful!
impl Resolver for SingleHostResolver { | ||
fn monitor(&mut self) -> watch::Receiver<AllBackends> { | ||
self.tx.subscribe() | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a blocker or anything, but: it occurs to me that qorb
could provide an impl Resolver for watch::Receiver<AllBackends>
that just clones the rx, which would kind of obviate the need for this type..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmmm I suppose so, but the choice of resolver still needs to be caller-controlled. I think we could probably adjust the API of Pool::new
to take a watch::Receiver<AllBackends>
instead of a Boxed Resolver
, but they feel pretty similar functionally, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks good to me! Up to you whether it's worth waiting for a review from someone else, as well.
let mut url = | ||
Url::parse(&format!("postgresql://{user}@{address}/{db}"))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i really thought there was a builder-style interface for constructing URLs from parts like this, but looking at the url
docs, i think it must be the other Rust URL library that has that...
This is still pretty much ready to go, but I'm interested in getting #5912 resolved first so we don't fragment our DNS client usage -- especially considering that we've been bumping into mysterious timeouts with hickory DNS that still need resolution. |
#5912 has now been resolved and the hickory-dns timeouts understood so I think this can move forward. |
Fixes #6505 , integrates usage of the new qorb APIs to terminate pools cleanly: oxidecomputer/qorb#45 # Background - [https://github.com/oxidecomputer/qorb](https://github.com/oxidecomputer/qorb) is a connection pooling crate, which spins up tokio task that attempt to connect to backends. - When `qorb` was integrated into Omicron in #5876, I used it to connect to our database backend (CockroachDB). This included usage in tests, even with a "single backend host" (one, test-only CRDB server) -- I wanted to ensure that we used the same pool and connector logic in tests and prod. # What Went Wrong As identified in #6505 , we saw some tests failing during **termination**. The specific cause of the failure was a panic from [async-bb8-diesel](https://github.com/oxidecomputer/async-bb8-diesel), where we attempted to spawn tasks with a terminating tokio executor. This issue stems from async-bb8-diesel's usage of `tokio::task::spawn_blocking`, where the returned `JoinHandle` is immediately awaited and **unwrapped**, with an expectation that "we should always be able to spawn a blocking task". There was a separate discussion about "whether or not async-bb8-diesel should be unwrapping in this case" (see: oxidecomputer/async-bb8-diesel#77), but instead, I chose to focus on the question: ## Why are we trying to send requests to async-bb8-diesel while the tokio runtime is exiting? The answer to this question lies in qorb's internals -- qorb itself spawns many tokio tasks to handle ongoing work, including monitoring DNS resolution, checking on backend health, and making connections before they are requested. One of these qorb tasks calling `ping_async`, which checks connection health, used the `async-bb8-diesel` interface that ultimately panicked. Within qorb most of these tokio tasks have a drop implementation of the form: ```rust struct MyQorbStructure { handle: tokio::task::JoinHandle<()>, } impl Drop for MyQorbStructure { fn drop(&mut self) { self.handle.abort(); } } ``` Tragically, without async drop support in Rust, this is the best we can do implicitly -- signal that the background tasks should stop ASAP -- but that may not be soon enough! Calling `.abort()` on the `JoinHandle` does not terminate the task immediately, it simply signals that it should shut down at the next yield point. As a result, we can still see the flake observed in #6505: - A qorb pool is initialized with background tasks - One of the qorb worker tasks is about to make a call to check on the health of a connection to a backend - The test finishes, and returns. The tokio runtime begins terminating - We call `drop` on the `qorb` pool, which signals the background task should abort - The aforementioned qorb worker task makes the call to `ping_async`, which calls `spawn_blocking`. This fails, because the tokio runtime is terminating, and returns a [JoinError::Cancelled](https://buildomat.eng.oxide.computer/wg/0/details/01J9YQVN7X5EQNXFSEY6XJBH8B/zfviqPz9RoJp3bY4TafbyqXTwbhqdr7w4oupqBtVARR00CXF/01J9YQWAXY36WM0R2VG27QMFRK#S6049). - `async-bb8-diesel` unwraps this `JoinError`, and the test panics. # How do we mitigate this? That's where this PR comes in. Using the new qorb APIs, we don't rely on the synchronous drop methods -- we explicitly call `.terminate().await` functions which do the following: - Use `tokio::sync::oneshot`s as signals to `tokio::tasks` that they should exit - `.await` the `JoinHandle` for those tasks before returning Doing this work explicitly as a part of cleanup ensures that there are not any background tasks attempting to do new work while the tokio runtime is terminating.
Replaces all usage of bb8 with a new connection pooling library called qorb.
qorb, detailed in RFD 477, provides the following benefits over bb8:
Fixes #4192
Part of #3763 (fixes CRDB portion)